What AI Reveals About How We've Always Evaluated Evidence

Feb 17

As AI enters classrooms, clinics, and research workflows, it is revealing something unexpected—not new problems, but long-standing assumptions we never had to articulate. This essay explores what becomes visible when a non-human participant holds up a mirror to our evidence practices, and why that visibility may matter more than the technology itself.

Something curious is happening in our evidence systems. As AI tools quietly enter classrooms, clinics, and research workflows, they are not so much changing our decisions as changing what we can see about how those decisions are made. Long-settled practices now feel slightly unsettled—not because they were careless or wrong, but because they were built for a different kind of participant in the room.

Consider the debates around AI-assisted grading. When educators ask whether student work supported by AI still "counts" as evidence of learning, the discomfort is palpable. Yet the unease rarely stems from the technology itself. It comes from a deeper realization: many of our judgments about effort, understanding, and authorship have always relied on tacit assumptions that were never fully articulated—because we never had to articulate them.

That raises a simple but generative question: What long-standing assumptions about how humans evaluate evidence become newly visible when AI holds up a mirror to our reasoning?

The Long History of Trusting What We Cannot See

Long before AI entered our workflows, humans were deeply practiced in trusting what they could not directly observe. Evidence has always been a bridge between the visible and the inferred—between what can be measured and what must be concluded about causes, intentions, learning, or risk.

Greek philosophers debated how knowledge could be justified when certainty was elusive. Medieval scholars developed methods for weighing authority and plausibility under profound uncertainty. The early modern period brought instruments and experiments that transformed what counted as evidence—and required new conventions for trust. By the twentieth century, these conventions hardened into professional standards—statistical inference, controlled trials, peer review—designed to make unseen mechanisms legible and comparable across contexts.

What matters for the present moment is not whether these systems were right or wrong, but that they were constructed. They reflect historical choices about what could be measured, who could decide, and how uncertainty should be managed. AI does not arrive into a vacuum; it enters a landscape shaped by centuries of such choices. And as it does, it makes that long history newly noticeable—not by overturning it, but by gently revealing the constructed seams of our evidence practices.

What AI Makes Visible

As AI becomes a routine participant in evidence-making, many professionals describe a sense that their usual standards still matter, yet no longer feel quite sufficient. This reflects not a sudden loss of rigor, but a growing awareness that some of our most trusted judgments have always rested on assumptions rarely examined. Three stand out.

The assumption that work products reveal understanding. In education, we have long graded essays and exams as evidence of what students know. But submitted work was always a proxy for understanding—a stand-in we trusted because the link seemed reliable. A recent UNESCO post acknowledges that "many conventional evaluation methods have been poor proxies for meaningful learning all along" (UNESCO, 2025). AI did not create this gap; it exposed it. When human markers achieve only 57–64% accuracy distinguishing AI-generated from student-written essays—barely better than chance—the proxy relationship can no longer be assumed (Fiedler & Döpke, 2025). Educators are now requiring drafts, revision histories, and metacognitive reflections, precisely because the old stand-in stopped working.

The assumption that experts integrate evidence without explaining how. Clinicians have always combined research findings, patient history, contextual cues, and experiential judgment into decisions. That integration was somewhat mysterious—even to the clinicians performing it—but it worked, so nobody demanded an explanation. Scholars noted years ago that evidence-based medicine "presupposes an inaccurate and deficient view of medical knowledge" because it cannot account for how tacit knowledge connects categories of understanding (Henry, 2006). Now AI systems offer their own diagnostic suggestions, and clinicians must decide when to trust them, override them, or blend them with their own judgment. The tacit skill is becoming explicit because it must be negotiated with a machine.

The assumption that following the method captures the intellectual work. In research synthesis—systematic reviews, meta-analyses—we have relied on documented procedures as proof of rigor: search terms, inclusion criteria, PRISMA flowcharts. But the judgment involved in selecting, weighting, and interpreting sources was never fully captured by the protocol. Scholars have long acknowledged that due to missing procedures and techniques, there is unlikely to be a trail of the decisions made for others to judge their worth (Kale et al., 2019). Now that AI can execute protocols rapidly, we are forced to specify what the human adds. Frameworks like RAISE (Responsible AI in Evidence Synthesis) represent explicit acknowledgment that the human contribution was always essential but never fully articulated (Flemyng et al., 2025; Thomas et al., 2025).

We trusted that essays proved understanding, that expert judgment needed no explanation, and that documented methods guaranteed rigor. Each assumption served us—until it had to be explained to a machine.

The common thread: AI does not introduce fragility into our evidence systems; it illuminates fragilities that were already there. These assumptions were always present, quietly doing their work. AI is the new participant in the room that does not necessarily share them—so now they must be spoken aloud.

The Disciplines That Got Us Here

The sense of strain around evidence today did not arise from neglect. It results from sustained work across multiple disciplines, each grappling—often independently—with the same challenge: how humans justify belief when certainty is out of reach.

Philosophical traditions have explored what it means to claim something is known, warranted, or trustworthy. Cognitive science and behavioral economics examined how people reason—revealing systematic patterns in judgment under uncertainty while showing that heuristics are often adaptive. Historical epistemology showed that evidence practices are inseparable from their material, institutional, and historical contexts. What counts as objectivity, rigor, or proof has never been fixed.

This body of work is remarkably rich. What has been harder to achieve is integration: a way of holding these insights together when evidence is produced and interpreted jointly by humans and AI systems. Not because the disciplines fell short, but because evidence produced jointly by humans and AI spans boundaries that no single discipline was designed to cross.

An Invitation, Not a Verdict

If this moment feels unsettled, that may be because it is asking something different of us—not answers, but attention. The question running beneath these pages is deliberately open: What long-standing assumptions about how humans evaluate evidence become newly visible when AI holds up a mirror to our reasoning?

Beneath these three assumptions lies a more unsettling recognition. Our evidence practices have long relied on "understanding"—as if we knew what it meant. The dominant view in education defines understanding as "the ability to think and act flexibly with what one knows" (Perkins, 1998); yet subsequent research shows that performance is "an unreliable index" of underlying learning (Soderstrom & Bjork, 2015). That definition does not distinguish between genuine comprehension and sophisticated pattern-matching. AI did not create this gap. It made it visible—and urgent.

Uncertainty here is not a weakness. It is a signal that we are encountering the edges of frameworks that have served us well, but were never designed for evidence produced, filtered, and interpreted in partnership with machines. Acknowledging that limit does not diminish the work that came before; it honors it by taking it seriously enough to ask what still holds.

Questions about evidence are not abstract puzzles. They surface wherever real decisions are made—by teachers assessing learning, clinicians weighing risk, researchers interpreting results, and leaders acting under uncertainty. In each of these settings, evidence shapes consequences that matter to people's lives.

What AI brings into view is not a new set of stakes, but a clearer view of how much those stakes have always depended on interpretation, judgment, and trust. Recognizing that does not weaken our standards; it sharpens our responsibility to understand them.

The work ahead is not about settling debates or declaring new rules. It is about learning to see our evidence practices more clearly, together, at the moment when clarity matters most.

This essay offers no framework—not because none is possible, but because the first step is seeing clearly what has been invisible. What comes next is not repair but articulation: making legible what has always been present but never needed to be said.

What would it mean to make the invisible scaffolding of your own evidence judgments visible—and who would you want in the room when you did?

Note on authorship and method: This essay was developed through a deliberate hybrid workflow in which I used a large language model as an interactive drafting and reasoning aid. I directed all prompts, curated and edited the text, verified claims, and take full responsibility for the final content.

References

Fiedler, A., & Döpke, J. (2025). Do humans identify AI-generated text better than machines? International Review of Economics Education, 49, Article 100321. https://doi.org/10.1016/j.iree.2025.100321

Flemyng, E., Noel-Storr, A., Macura, B., Gartlehner, G., Thomas, J., Meerpohl, J. J., Jordan, Z., Jemioło, P., & others. (2025). Position statement on artificial intelligence (AI) use in evidence synthesis across Cochrane, the Campbell Collaboration, JBI and the Collaboration for Environmental Evidence. Environmental Evidence, 14(20). https://doi.org/10.1186/s13750-025-00374-5

Henry, S. G. (2006). Recognizing tacit knowledge in medical epistemology. Theoretical Medicine and Bioethics, 27(3), 187–213.

Kale, A., Kay, M., & Hullman, J. (2019). Decision-making under uncertainty in research synthesis: Designing for the garden of forking paths. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems (pp. 1–14). ACM. https://doi.org/10.1145/3290605.3300432

Perkins, D. N. (1998). What is understanding? In M. S. Wiske (Ed.), Teaching for understanding: Linking research with practice (pp. 39–57). Jossey-Bass.

Soderstrom, N. C., & Bjork, R. A. (2015). Learning versus performance: An integrative review. Perspectives on Psychological Science, 10(2), 176–199.

Thomas, J., Flemyng, E., Noel-Storr, A., Moy, W., Marshall, I. J., Hajji, R., Jordan, Z., Macura, B., et al. (2025). Responsible AI in Evidence SynthEsis (RAISE): Guidance and recommendations (Version 2). Open Science Framework. https://osf.io/fwaud/

UNESCO. (2025). What's worth measuring? The future of assessment in the AI age. https://www.unesco.org/en/articles/whats-worth-measuring-future-assessment-ai-age

Stephen Watt

What AI Reveals About How We've Always Evaluated Evidence

The Long History of Trusting What We Cannot See

What AI Makes Visible

The Disciplines That Got Us Here

An Invitation, Not a Verdict

References

Contact

Location

What AI Reveals About How We've Always Evaluated Evidence

The Long History of Trusting What We Cannot See

What AI Makes Visible

The Disciplines That Got Us Here

An Invitation, Not a Verdict

References

Why Evidence Fails to Travel

Contact

Location